NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Run-length compressed metagenomic read classification with SMEM-finding and tagging

https://doi.org/10.1101/2025.02.25.640119

Depuydt, Lore; Ahmed, Omar Y; Fostier, Jan; Langmead, Ben; Gagie, Travis (February 2025, bioRxiv)

Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license.
more » « less
Free, publicly-accessible full text available February 28, 2026
Movi: A fast and cache-efficient full-text pangenome index

https://doi.org/10.1016/j.isci.2024.111464

Zakeri, Mohsen; Brown, Nathaniel K; Ahmed, Omar Y; Gagie, Travis; Langmead, Ben (December 2024, iScience)

Full Text Available
PEEV: Parse Encrypt Execute Verify - A Verifiable FHE Framework

https://doi.org/10.1109/ACCESS.2024.3424420

Ahmed, Omar; Gouert, Charles; Tsoutsos, Nektarios Georgios (July 2024, IEEE Access)

Cloud computing has been a prominent technology that allows users to store their data and outsource intensive computations. However, users of cloud services are also concerned about protecting the confidentiality of their data against attacks that can leak sensitive information. Although traditional cryptography can be used to protect static data or data being transmitted over a network, it does not support processing of encrypted data. Homomorphic encryption can be used to allow processing directly on encrypted data, but a dishonest cloud provider can alter the computations performed, thus violating the integrity of the results. To overcome these issues, we propose PEEV (Parse, Encrypt, Execute, Verify), a framework that allows a developer with no background in cryptography to write programs operating on encrypted data, outsource computations to a remote server, and verify the correctness of the computations. The proposed framework relies on homomorphic encryption techniques as well as zero-knowledge proofs to achieve verifiable privacy-preserving computation. It supports practical deployments with low performance overheads and allows developers to express their encrypted programs in a high-level language, abstracting away the complexities of encryption and verification.
more » « less
Full Text Available
Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests;

Draesslerova, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Benjamin; Manzini, Giovanni; Navarro, Gonzalo (July 2024, SEA 2024)

Full Text Available
Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

https://doi.org/10.4230/LIPIcs.SEA.2024.10

Draesslerová, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Liberti, Leo (Ed.)
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
more » « less
Full Text Available
SPUMONI 2: improved classification using a pangenome index of minimizer digests

https://doi.org/10.1186/s13059-023-02958-1

Ahmed, Omar Y.; Rossi, Massimiliano; Gagie, Travis; Boucher, Christina; Langmead, Ben (May 2023, Genome Biology)

Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
more » « less
Efficient taxa identification using a pangenome index

https://doi.org/10.1101/gr.277642.123

Ahmed, Omar; Rossi, Massimiliano; Boucher, Christina; Langmead, Ben (January 2023, Genome Research)

Full Text Available
Pan-genomic matching statistics for targeted nanopore sequencing

https://doi.org/10.1016/j.isci.2021.102696

Ahmed, Omar; Rossi, Massimiliano; Kovaka, Sam; Schatz, Michael C.; Gagie, Travis; Boucher, Christina; Langmead, Ben (June 2021, iScience)
null (Ed.)
Full Text Available
Instantaneous amplitude and shape of postrhinal theta oscillations differentially encode running speed.

https://doi.org/10.1037/bne0000416

Ghosh, Megha; Shanahan, Benjamin E.; Furtak, Sharon C.; Mashour, George A.; Burwell, Rebecca D.; Ahmed, Omar J. (December 2020, Behavioral Neuroscience)
null (Ed.)
Full Text Available
Micro-Compression of Freestanding Electroplated Copper Through-Glass Vias

https://doi.org/10.1109/TDMR.2020.2973940

Ahmed, Omar; Jalilvand, Golareh; Okoro, Chukwudi; Pollard, Scott; Jiang, Tengfei (March 2020, IEEE Transactions on Device and Materials Reliability)
null (Ed.)
Full Text Available

Search for: All records